16 research outputs found

    SiT: Self-supervised vIsion Transformer

    Full text link
    Self-supervised learning methods are gaining increasing traction in computer vision due to their recent success in reducing the gap with supervised learning. In natural language processing (NLP) self-supervised learning and transformers are already the methods of choice. The recent literature suggests that the transformers are becoming increasingly popular also in computer vision. So far, the vision transformers have been shown to work well when pretrained either using a large scale supervised data or with some kind of co-supervision, e.g. in terms of teacher network. These supervised pretrained vision transformers achieve very good results in downstream tasks with minimal changes. In this work we investigate the merits of self-supervised learning for pretraining image/vision transformers and then using them for downstream classification tasks. We propose Self-supervised vIsion Transformers (SiT) and discuss several self-supervised training mechanisms to obtain a pretext model. The architectural flexibility of SiT allows us to use it as an autoencoder and work with multiple self-supervised tasks seamlessly. We show that a pretrained SiT can be finetuned for a downstream classification task on small scale datasets, consisting of a few thousand images rather than several millions. The proposed approach is evaluated on standard datasets using common protocols. The results demonstrate the strength of the transformers and their suitability for self-supervised learning. We outperformed existing self-supervised learning methods by large margin. We also observed that SiT is good for few shot learning and also showed that it is learning useful representation by simply training a linear classifier on top of the learned features from SiT. Pretraining, finetuning, and evaluation codes will be available under: https://github.com/Sara-Ahmed/SiT

    LT-ViT: A Vision Transformer for multi-label Chest X-ray classification

    Full text link
    Vision Transformers (ViTs) are widely adopted in medical imaging tasks, and some existing efforts have been directed towards vision-language training for Chest X-rays (CXRs). However, we envision that there still exists a potential for improvement in vision-only training for CXRs using ViTs, by aggregating information from multiple scales, which has been proven beneficial for non-transformer networks. Hence, we have developed LT-ViT, a transformer that utilizes combined attention between image tokens and randomly initialized auxiliary tokens that represent labels. Our experiments demonstrate that LT-ViT (1) surpasses the state-of-the-art performance using pure ViTs on two publicly available CXR datasets, (2) is generalizable to other pre-training methods and therefore is agnostic to model initialization, and (3) enables model interpretability without grad-cam and its variants.Comment: 5 pages, 2 figure

    Multi-label networks for face attributes classification

    Get PDF

    Masked Momentum Contrastive Learning for Zero-shot Semantic Understanding

    Full text link
    Self-supervised pretraining (SSP) has emerged as a popular technique in machine learning, enabling the extraction of meaningful feature representations without labelled data. In the realm of computer vision, pretrained vision transformers (ViTs) have played a pivotal role in advancing transfer learning. Nonetheless, the escalating cost of finetuning these large models has posed a challenge due to the explosion of model size. This study endeavours to evaluate the effectiveness of pure self-supervised learning (SSL) techniques in computer vision tasks, obviating the need for finetuning, with the intention of emulating human-like capabilities in generalisation and recognition of unseen objects. To this end, we propose an evaluation protocol for zero-shot segmentation based on a prompting patch. Given a point on the target object as a prompt, the algorithm calculates the similarity map between the selected patch and other patches, upon that, a simple thresholding is applied to segment the target. Another evaluation is intra-object and inter-object similarity to gauge discriminatory ability of SSP ViTs. Insights from zero-shot segmentation from prompting and discriminatory abilities of SSP led to the design of a simple SSP approach, termed MMC. This approaches combines Masked image modelling for encouraging similarity of local features, Momentum based self-distillation for transferring semantics from global to local features, and global Contrast for promoting semantics of global features, to enhance discriminative representations of SSP ViTs. Consequently, our proposed method significantly reduces the overlap of intra-object and inter-object similarities, thereby facilitating effective object segmentation within an image. Our experiments reveal that MMC delivers top-tier results in zero-shot semantic segmentation across various datasets

    Deep Convolutional Neural Network Ensembles using ECOC

    Full text link
    Deep neural networks have enhanced the performance of decision making systems in many applications including image understanding, and further gains can be achieved by constructing ensembles. However, designing an ensemble of deep networks is often not very beneficial since the time needed to train the networks is very high or the performance gain obtained is not very significant. In this paper, we analyse error correcting output coding (ECOC) framework to be used as an ensemble technique for deep networks and propose different design strategies to address the accuracy-complexity trade-off. We carry out an extensive comparative study between the introduced ECOC designs and the state-of-the-art ensemble techniques such as ensemble averaging and gradient boosting decision trees. Furthermore, we propose a combinatory technique which is shown to achieve the highest classification performance amongst all.Comment: 13 pages double column IEEE transactions styl

    Skin lesion classification with deep CNN ensembles

    Get PDF
    Early detection of skin cancer is vital when treatment is most likely to be successful. However, diagnosis of skin lesions is a very challenging task due to the similarities between lesions in terms of appearance, location, color, and size. We present a deep learning method for skin lesion classification by fusing and fine-tuning three pre-trained deep learning architectures (Xception, Inception-ResNet-V2, and NasNetLarge) using training images provided by ISIC2019 organizers. Additionally, the outliers and the heavy class imbalance are addressed to further enhance the classification of the lesion. The experimental results show that the proposed framework obtained promising results that are comparable with the ISIC2019 challenge leader board

    Deep learning ensembles for image understanding

    No full text
    Deep neural networks have enhanced the performance of decision making systems in many applications, including image understanding. Further performance gains can be achieved by using ensemble methods, which are shown to be powerful tools for various classification and regression tasks. This dissertation consists of two parts. The first part is devoted to studying the face attributes classification problem. We introduce several novel approaches for this problem, achieving state-of-art results on CelebA and LFWA datasets: i) we use the multi-task learning (MTL) framework for multiple attributes classification for scalability, where base learners are grouped according to the location of the attribute on the face and share weights. Giving information about the location of an attribute as prior information is shown to speed up the learning process and lead to increased accuracy. ii) we introduce a novel ensemble learning technique within the deep learning model itself (within-network ensemble), showing increased performance at almost the same time complexity of a single model. iii) we propose a new framework called Deep-RankSVM for relative attribute classification (comparing the attribution expression on two photographs) adapting the SVM formulation to deep rank learning. The second part is devoted to analyzing the suitability of different state-of-art design strategies for constructing ensembles of deep networks. We propose the Error Correcting Output Codes (ECOC) framework as a novel deep learning ensemble method, and show that it can be used with the MTL framework for arbitrary accuracycomplexity trade-off. We carry out an extensive comparative study between the introduced ECOC designs and the state-of-the-art ensemble techniques such as ensemble averaging and gradient boosting decision trees, on several datasets. In the rest of the dissertation, we discuss general applications of the proposed ensemble techniques that include skin lesion classification and plant identification

    Relative attribute classification with deep-ranksvm

    No full text
    Relative attributes indicate the strength of a particular attribute between image pairs. We introduce a deep Siamese network with rank SVM loss function, called Deep-RankSVM, that can decide which one of a pair of images has a stronger presence of a specific attribute. The network is trained in an end-to-end fashion to jointly learn the visual features and the ranking function. The trained network for an attribute can predict the relative strength of that attribute in novel images. We demonstrate the effectiveness of our approach against the state-of-the-art methods on four image benchmark datasets: LFW-10, PubFig, UTZap50K-2 and UTZap50K-lexi datasets. Deep-RankSVM surpasses state-of-art in terms of the average accuracy across attributes, on three of the four image benchmark datasets

    Relative attributes classification via transformers and rank SVM loss

    No full text
    We propose a new model for learning to rank two images with respect to their relative strength of expression for a given attribute. We address this problem – called relative attribute learning — using a vision transformer backbone. The embedded representations of the two images to be compared are extracted and used for comparison with a ranking head, in an end-to-end fashion. The results demonstrate the strength of vision transformers and their suitability for relative attributes classification. Our proposed approach outperforms the state-of-the-art by a large margin, achieving 90.40% and 98.14% mean accuracy over the attributes of LFW-10 and Pubfig datasets
    corecore